Published 2004-11-01 22:59:54

When John released his bindings to html tidy, I joked with him, that it would have been far more interesting (as a project), to write a proper HTML lexer, rather than bind to an existing library. (mainly cause having written one in PHP, I didnt think it would be that difficult), and I have a strange idea of fun...

Well, over the weekend, I was re-pondering this. Partly due to the fact I had used the Flexy Parser to try and parse HTML from a web site, and found the tokenizer in Flexy was getting slower with age (5seconds on average to parse a page). While this is not a huge issue normally, as this parsing is cached during the compiling phase of template engine. It is a huge issue if you are pulling pages down, parsing out the forms, and reposting the forms in a web test script.

So over the weekend after a little google search and discover trip, I ran across a little w3c project, "A Lexical Analyzer for HTML and SGML", It looked interesting, but it wasnt until I pulled the code down, untared and built it, that I realized it could be used to write a really fast, and simple HTML tokenizer. (not only that, it could easily form the basis of a C based backend for Flexy.)

To create an extension that used the code (not a library, but just pulled in the C code into a PHP extension), and parse a string of HTML took about 30 minutes.. - It took an extra 3 hours, on and off over a few days, to make it return a array of tokens (with attributes sorted into a sensible structure.)

So now I have a cute extension that has 1 function, and 1 result, KISS at it's best..

<?php
print_r(
flexyparser_tokenize(
file_get_contents("..some file...")
));

Outputs:

[0] => Array
(
[0] => 14 // token type (look up the source)
[1] => // data (tag name or string)
[2] => 1 // line number
[3] => 0 // character position
)

[1] => Array
(
[0] => 1
[1] =>

[2] => 2
[3] => 50
)

[2] => Array
(
[0] => 2
[1] => HTML
[2] => 2
[3] => 51
)

[3] => Array
(
[0] => 2
[1] => HEAD
[2] => 2
[3] => 57
)
.....
......
[15] => Array
(
[0] => 2
[1] => A
[2] => 6
[3] => 212
[4] => Array // array of attributes
(
[HREF] => "/pub/WWW/Consortium/"
)

)

[16] => Array
(
[0] => 2
[1] => IMG
[2] => 7
[3] => 243
[4] => Array
(
[align] => bottom

[src] => "/pub/WWW/Icons/WWW/w3c_48x48"
)

)
the code is in my svn server, under akpear/flexyparser, works perfectly with PHP5 and PHP4 at the moment.

I really want to do a tree version of this, that loads data into a user defined object: eg.
<?php
$tree = flexyparser_toTree($data, new MyClass);

so it can be used 'how you want it...'



Mentioned By:
www.experts-exchange.com : PHP: PHP function to parse table cell contents? (250 referals)
google.com : php5 html parser (81 referals)
google.com : november (72 referals)
google.com : php html tokenizer (56 referals)
google.com : april (45 referals)
google.com : php parse html (31 referals)
google.com : PHP HTML parser (21 referals)
google.com : php "html to array" (15 referals)
marc.theaimsgroup.com : MARC: msg '[PECL-DEV] flexyparser.. or anothername..' (13 referals)
google.com : html parser php (12 referals)
google.com : php html_parse (12 referals)
google.com : html parser php5 (11 referals)
google.com : html parse php4 (10 referals)
google.com : PHP parse array (10 referals)
google.com : parsing html with php (9 referals)
google.com : php parse html to array (9 referals)
google.com : PHP5 parse HTML (9 referals)
google.com : html tokenizer php (8 referals)
google.com : parse array php (8 referals)
google.com : php "parse HTML" (8 referals)

Comments

Interesting. Despite tidy and the PHP5 DOM (HTML support) a decent HTML parser is really needed.

How well does it cope with some of harder stuff like script tags where Javascript itself writes HTML?
#0 - Harry Fuecks ( Link) on 2004-11-02 02:17:52 Delete Comment
No idea in which state it currently is, but did you see: http://pecl.php.net/package/html_parse
#1 - Jan Schneider ( Link) on 2004-11-02 02:39:47 Delete Comment

Add Your Comment

Follow us on